Monte Carlo Methods for Top-k Personalized PageRank Lists and Name Disambiguation

نویسندگان

  • Konstantin Avrachenkov
  • Nelly Litvak
  • Danil Nemirovsky
  • Elena Smirnova
  • Marina Sokol
چکیده

We study a problem of quick detection of top-k Personalized PageRank lists. This problem has a number of important applications such as finding local cuts in large graphs, estimation of similarity distance and name disambiguation. In particular, we apply our results to construct efficient algorithms for the person name disambiguation problem. We argue that when finding top-k Personalized PageRank lists two observations are important. Firstly, it is crucial that we detect fast the top-k most important neighbours of a node, while the exact order in the top-k list as well as the exact values of PageRank are by far not so crucial. Secondly, a little number of wrong elements in top-k lists do not really degrade the quality of top-k lists, but it can lead to significant computational saving. Based on these two key observations we propose Monte Carlo methods for fast detection of top-k Personalized PageRank lists. We provide performance evaluation of the proposed methods and supply stopping criteria. Then, we apply the methods to the person name disambiguation problem. The developed algorithm for the person name disambiguation problem has achieved the second place in the WePS 2010 competition. Key-words: Personalized PageRank, Monte Carlo Methods, Person Name Disambiguation ∗ INRIA Sophia Antipolis-Méditerranée, France, [email protected] † University of Twente, The Netherlands, [email protected] ‡ INRIA Sophia Antipolis-Méditerranée, France, [email protected] § INRIA Sophia Antipolis-Méditerranée, France, [email protected] ¶ INRIA Sophia Antipolis-Méditerranée, France, [email protected] Les Méthodes Monte Carlo pour Top-k Listes de PageRank Personnalisé avec l’application a disambiguation de noms Résumé : Nous étudions le problème de détection rapide de top-k listes de PageRank Personnalisé. Ce problème a plusieurs applications importantes telles que la recherche des coupes locales de graphes, l’éstimation de la distance de la similarité, et disambiguation de noms. En particulier, nous appliquons nos resultats a construction des algorithmes efficaces pour le problème de disambiguation de noms de personnes. Notre étude est basé sur les deux observations suivantes. D’abord, il est cruciale que nous trouvons rapidement les top-k voisins les plus importants d’un noeud. Cependant, l’ordre exact dans le top-K ainsi que les valeurs exactes de PageRank sont de loin pas si cruciale. Deuxiemement, un petit nombre de elements erronés dans les top-k listes ne degrade pas vraiment la qualite des listes de top-k, mais ce sacrifice améliore significativement la performance des algorithmes. Sur la base de ces deux observations clés nous proposons des méthodes de type Monte Carlo pour la détection rapide de top-k listes de PageRank Personnalisé. Nous offrons l’évaluation des performances des méthodes proposées et nous donnons critères d’arrêt. En suite, nous appliquons les méthodes au problème de disambiguation de noms de personnes. Notre approche basé sur PageRank Personnalisé et les méthodes Monte Carlo a recu le deuxième prix de la compétion WePS 2010. Mots-clés : PageRank Personnalisé, Méthodes Monte Carlo, Disambiguation de Noms de Personnes Monte Carlo Methods for Top-k Personalized PageRank Lists 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quick Detection of Top-k Personalized PageRank Lists

We study a problem of quick detection of top-k Personalized PageRank (PPR) lists. This problem has a number of important applications such as finding local cuts in large graphs, estimation of similarity distance and person name disambiguation. We argue that two observations are important when finding top-k PPR lists. Firstly, it is crucial that we detect fast the top-k most important neighbors ...

متن کامل

Fast Incremental and Personalized PageRank

In this paper, we analyze the efficiency of Monte Carlo methods for incremental computation of PageRank, personalized PageRank, and similar random walk based methods (with focus on SALSA), on large-scale dynamically evolving social networks. We assume that the graph of friendships is stored in distributed shared memory, as is the case for large social networks such as Twitter. For global PageRa...

متن کامل

Efficient Algorithms for Personalized Pagerank a Dissertation Submitted to the Department of Computer Science and the Committee on Graduate Studies of Stanford University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

We present new, more efficient algorithms for estimating random walk scores such as Personalized PageRank from a given source node to one or several target nodes. These scores are useful for personalized search and recommendations on networks including social networks, user-item networks, and the web. Past work has proposed using Monte Carlo or using linear algebra to estimate scores from a sin...

متن کامل

Personalized Hitting Time for Informative Trust Mechanisms Despite Sybils

Informative and scalable trust mechanisms that are robust to manipulation by strategic agents are a critical component of multi-agent systems. While the global hitting time mechanism (GHT) introduced by Hopcroft and Sheldon [9] is more robust to manipulation than PageRank, strategic agents can still benefit significantly under GHT by performing sybil attacks. In this paper, we introduce the per...

متن کامل

UBC Entity Discovery and Linking & Diagnostic Entity Linking

This paper describe the runs submitted by the UBC team at TAC-KBP 2014 for both English Entity Discovery and Linking (EDL) and Diagnostic Entity Linking (DEL) tasks. Our main interest was to compare the performance between two totally different name entity recognizer systems and to combine them with three different name entity disambiguation systems that were developed for the TACKBP 2013 EL ta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1008.3775  شماره 

صفحات  -

تاریخ انتشار 2010